-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the condition for calling update_learning_rates #7032
Fix the condition for calling update_learning_rates #7032
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7032 +/- ##
=======================================
- Coverage 92% 88% -4%
=======================================
Files 196 196
Lines 12835 12833 -2
=======================================
- Hits 11832 11268 -564
- Misses 1003 1565 +562 |
Can you write a test that fails on current master without the fix? |
I'm sorry that I have no idea how to write a test for this issue... Even if |
@tkng You must have observed this bug somehow in your code right? If you provide a minimal example (see our bug_report_model), we can easily help you convert it to a good regression test :) |
@awaelchli thanks, I wrote a minimal example and uploaded to https://gist.github.com/tkng/a89ffdab2558aedc04a67f7142e6aeb0. (Sorry, I noticed to bug_report_model.py after I wrote this example, so the code is based on lightning-flash.) Once you run this example and opened the log directory by tensorboard, you'll see this learning rate decay. Pink one is with current pytorch-lightning, green one is with the patch attached on this PR. |
Thanks. Based on your description and code, I pushed a minimal test that fails on master. |
Why is the learning rate update dependent on the validation loop at all? |
I suspect because of this scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
well, the solution here is a bit incorrect. val_loop_called actually tells whether the val_loop is called within the training loop or not for the cases where val_check_interval(float) < 1. or val_check_interval < num_train_batches. In such a case val_loop_called will be True and thus we need to ensure to explicitly call the epoch level schedulers explicitly outside the train loop if there is no epoch end validation going to happen. I have updated a test #7084 which will fail with your changes. Epoch level schedulers also update within the evaluation loop because we need to ensure they get updated after the validation loop and right before checkpointing to ensure the correct scheduler state dict being saved in the checkpoint file. I know this is a bit complicated but most of the complexity is due to this ReduceLROnPlateau scheduler which needs a monitor to update itself. This monitor can belong to anything logged in either train or val. To handle the cases where check_val_every_n_epoch > 1, I think we need to somehow pull out the scheduler update from val_loop and still need to ensure that correct scheduler state_dict gets updated within the checkpoint file (we have tests for that) and ofcourse should work with ReduceLROnPlateau (I think we have tests for that too). |
@rohitgr7 Hmm, the problem seems more complicated than I had thought... May I assume that update_learning_rates should always be called at the end of one training epoch, even for ReduceLROnPlateau? If the assumption is correct, then I think the correct code would be to call update_learning_rate immediately after the end of the training process for one epoch, in the run_training_epoch method of TrainLoop class. I'm thinking this understanding is consistent with your statement, I think we need to somehow pull out the scheduler update from val_loop and still need to ensure that correct scheduler state_dict gets updated within the checkpoint file. I tried to write such a code once, but I could not make the unit test pass. On the other hand, if I discarded the above understanding and just considered passing the unit test, I only needed to change one line, as I just pushed. (Just for sure, this also passes the test for #7084.) However, I am not sure whether my current code is correct. 🤔 |
yeap |
Converting to draft so merging is blocked until these comments are addressed 👍 |
@tkng how is it going here, still WIP or ready to go? 🐰 |
@Borda This pull request passes the current all unit tests, but according to @rohitgr7, it is incompatible with ReduceLROnPlateau. I have tried to rewrite this PR to compatible with ReduceLROnPlateau, but I have failed to write such code that passes all tests and still conforms to ReduceLROnPlateau. I think we have two options.
I am OK with either, but we'll need a opinion from @rohitgr7. |
0a1cd5e
to
76cb58d
Compare
for more information, see https://pre-commit.ci
I tried something, but couldn't find an optimal solution that can cover all the cases because of this So I'd say merge this one. |
max_epochs=epochs, | ||
) | ||
trainer.fit(model) | ||
assert mocked_sched.call_count == expected_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noob question: why do we want to update schedulers all epochs even if we only run validation some of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running validation on some epoch is just a choice. Ideally, a scheduler should update after every epoch if scheduler['frequency] = 1(default)
and scheduler['interval'] == 'epoch'
. If someone wants it to align it with validation, they can set scheduler['frequency'] = check_val_every_n_epoch
.
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do?
This pull request fixes the condition for
update_learning_rates
to be called intraining_loop.py
.update_learning_rates
in theOptimizerConnector
class is called from several places. In thetraining_loop.py
, one of the calling conditions is missing anot
.As a result, when Trainer's
check_val_every_n_epoch
is set to a value greater than 2, the learning rate is not updated as expected. For example, ifcheck_val_every_n_epoch
is set to 2, the learning rate will be updated only once every 2 epochs.more detail
The current condition for calling
update_learning_rates
is as follows.The comment above the line says "update epoch level lr_schedulers if no val loop outside train loop is triggered". Current code doesn't work as the comment expects.
This pull request fixes this issue by inserting
not
before theval_loop_called
.Fixes #6616
By the way, we can assume that
update_learning_rates
does not have any side effects (in fact, I check the current behavior of PyTorch-Lightning by writing a custom learning rate rule by myself. Currently, it is called twice in one epoch). So, I think we can call it every time in the training loop without thinking anything difficult. However, since I don't understand the whole code of this project, I kept the amount of changes in this pull request as small as possible.Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list: